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DETAILED ACTION 
Response to Arguments 

1 . The applicant has submitted a response, filed 9/30/2004, arguing that the art rejection or 
record fails to teach a process of capturing live images relating to the actual utterance to be 
presented (amendment pages 8-10), a process of extracting visual features (amendment page 7, 
line 7), and an image player coupled to the visual feature extractor presenting image segment 
with corresponding decoded word (amendment page 10, lines 21-23). However, the image- 
capturing step in the base claims 1,10, and 18 does not mention anything about capturing live 
images corresponding to one or more words in the utterance. The process of capturing images in 
the base claims 1,10, and 18, can capture either prerecorded or live images as long as there are 
images available at the input of the claimed visual detector. The examiner also respectfully 
disagree with applicant's argument that the prior art of record, specifically Bregler, fails to 
disclose a process of extracting image feature and an image player couple to the visual extractor 
presenting image segment with corresponding decoded word. In fact, Bregler discloses a process 
of analyzing images to identify visual features such as speaker's lip position (col. 5, lines 49-60) 
and an image player presenting video images (figures 6-7). 

2. In response to applicant's argument that the references fail to show certain features of 
applicant's invention, it is noted that the features upon which applicant relies (i.e. ^repeatedly 
presenting" refers to "looping on a time sequence of successive images associated with a 
particular word(s) in the utterance" (amendment page 11, lines 10-12)) are not recited in the 
rejected claim(s). Although the claims are interpreted in light of the specification, limitations 
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from the specification are not read into the claims. See In re Van Geuns, 988 F.2d 1 181, 26 
USPQ2d 1057 (Fed. Cir. 1993). The examiner interprets that one ore more image segments 
repeats in a sequential order of video frames displaying movement/motion of images {figures 6-7 
shows two different images, any motion picture player employs this characteristic). 

3. As per claims 3 and 12, applicant argues that Bregler fails to teach the step of controlling 
time delay between image segment and a corresponding decoded word {amendment page If 
lines 14-21). The examiner respectfully disagrees. Aligning image segment with a 
corresponding decoded must involve time controlling, and time controlling involves with 
adjusting time delay in order to realize synchronization (time warping does all of these). 

4. In response to applicant's argument that Waters (US 6256046) is nonanalogous art, it has 
been held that a prior art reference must either be in the field of applicant's endeavor or, if not, 
then be reasonably pertinent to the particular problem with which the applicant was concerned, 
in order to be relied upon as a basis for rejection of the claimed invention. See In re Oetiker, 977 
F.2d 1443, 24 USPQ2d 1443 (Fed. Cir. 1992). In this case, Waters is relied upon for the 
teaching of measuring the user's position and generating control signal based on the detected 
position of the user. 

5. In response to applicant's argument that Liou et al. (US 6580437) is nonanalogous art, it 
has been held that a prior art reference must either be in the field of applicant's endeavor or, if 
not, then be reasonably pertinent to the particular problem with which the applicant was 
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concerned, in order to be relied upon as a basis for rejection of the claimed invention. See In re 
Oetiker, 977 F.2d 1443, 24 USPQ2d 1443 (Fed. Cir. 1992). In this case, Liou et al is relied 
upon for the teaching displaying decoded speech text on the display. 

6. As per claims 9 and 22, applicant argues that Poggio fails to disclose an image player 
displaying image segments in separate widow {amendment page 13, lines 14-28). The examiner 
respectfully disagrees. Piggio teaches this limitation in figures 6-9 {see claim rejection). 

7. As per claims 13 and 24, applicant argues that both Bregler and Goldenthal fail to 
disclose the step of "aligning the plurality of images into one or more image segments according 
to the start and stop times received from the ASR system, wherein each image segment 
corresponds to a decoded word in the utterance" {amendment page 14, lines 15-27). The 
examiner respectfully disagrees. The combination of Bregler and Goldenthal teach this 
limitation {see claim rejection). 

Claim Rejections - 35 USC § 102 

8. The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that 
form the basis for the rejections under this section made in this Office action: 

A person shall be entitled to a patent unless - (b) the invention was patented or described in a printed 
publication in this or a foreign country or in public use or on sale in this country, more than one year prior to 
the date of application for patent in the United States. 



9. Claims 1-3, 6, 10-12, 18, 20-21 and 23 are rejected under 35 U.S.C. 102(b) as 
being anticipated by Bregler (US Patent No. 5880788). 
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10. Regarding claim 1 , Bregler discloses an apparatus for presenting images 
representative of one or more words in an utterance with corresponding decoded 
speech, the apparatus comprising: 

a visual detector, the visual detector capturing images of body movements 
corresponding to one or more words in the utterance (co/. 4, In. 1-31 & col. 7, In. 1-1 1)\ 

a visual feature extractor coupled to the visual detector, the visual feature 
extractor receiving time information from an automatic speech recognition (ASR) system 
and operatively processing the captured images into one or more image segments 
based on the time information relating to one or more words, decoded by the ASR 
system, in the utterance, each image segment comprising a plurality of successive 
images in time corresponding to a decoded word in the utterance (col. 4, In. 32 to col. 5, 
In. 48, the visual feature extractor is the image analysis S2 in figure 1, which analyze 
movement of the lips. The timing information is interpreted as phonetic information that 
is used to process the captured images into one or more image segments): and 

an image player operatively coupled to the visual feature extractor, the image 
player receiving and presenting each image segment with the corresponding decoded 
word (figures 6-7, the decoded word can be presented audibly). 

1 1 . Regarding claim 10, Bregler discloses an apparatus for presenting images 
representative of one or more words in an utterance with corresponding decoded 
speech, the apparatus comprising: 
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an automatic speech recognition (ASR) engine for converting the utterance into 
one or more decoded words, the ASR engine generating time information associated 
with each of the decoded words (col. 5, In. 28-48, phonetic information is the time-stamp 
used to link corresponding image segments)] 

a visual detector, the visual detector capturing images of body movements 
corresponding to one or more words in the utterance (col. 4, In. 1-31 & col. 7, In. 1-1 1)\ 

a visual feature extractor coupled to the visual detector, the visual feature 
extractor receiving the time information from the ASR engine and operatively processing 
the captured images into one or more image segments based on the time information 
relating to the decoded words, each image segment comprising a plurality of successive 
images in time corresponding to a decoded word in the utterance (col. 4, In. 32 to col. 5, 
In. 48, the visual feature extractor is the image analysis S2 in figure 1, which analyze 
movement of the lips. The timing information is interpreted as phonetic information that 
is used to process the captured images into one or more image segments)] and 

an image player operatively coupled to the visual feature extractor, the image 
player receiving and presenting each image segment with the corresponding decoded 
word (figures 6-7, the decoded word can be presented audibly). 

12. Regarding claim 18, Bregler discloses that in an automatic speech recognition 
(ASR) system for converting an utterance of a speaker into one or more decoded 
words, a method for enhancing the ASR system comprising the steps of: 
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capturing a plurality of successive images in time representing body movements 
corresponding to one or more words in the utterance (col. 4, In. 1-31 & col. 7, In. 1-1 1)\ 

associating each of the captured images with time information relating to an 
occurrence of the image (col. 5, In. 28-48, phonetic information is the time information)] 

obtaining, from the ASR system, time ends for each decoded word in the 
utterance (col. 8, In. 57 to col. 9, In. 15, phonetic information is the time information)] 

grouping the plurality of images into one or more image segments based on the 
time ends, wherein each image segment corresponds to a decoded word in the 
utterance (col. 9, In. 1-16)] and 

presenting an image segment with a corresponding decoded word (figures 6-7, 
the decoded word can be presented audibly). 

13. Regarding claims 2 and 1 1 , Bregler further discloses that the image player 
repeatedly presents one or more image segments with the corresponding decoded word 
(col. 7, In. 1-54). 

14. Regarding claims 3 and 12, Bregler further discloses a delay controller 
operatively coupled to the visual feature extractor, the delay controller selectively 
controlling a delay between an image segment and a corresponding decoded word in 
response to a control signal (col. 9, In. 48 to col. 10, In. 57, by utilizing time warping or 
time-scaled modification techniques to realize synchronization is the delay). 
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15. Regarding claims 6 and 23, Bregler further disclose that the body movements 
include lip movements of the speaker and mouth movements of the speaker (col. 7, 1- 
10). 



1 6. Regarding claim 20, Bregler further discloses the step of comparing the time 
information relating to the captured images with the time ends for a decoded word (col. 
10, In. 13-30, indicative of comparing timing information and processing audio and 
image segments to achieve synchronization), and determining which of the plurality of 
images occur within a time interval defined by the time ends of the decoded word (coL 
9, In. 35-62). 

17. Regarding claim 21, Bregler further discloses the step of repeatedly displaying 
the image segment as an animation of successive images in time (col. 8, In. 12-44). 



Claim Rejections - 35 USC § 103 

18. The following is a quotation of 35 U.S.C. 103(a) which forms the basis for all 
obviousness rejections set forth in this Office action: 

(a) A patent may not be obtained though the invention is not identically disclosed or described as set 
forth in section 102 of this title, if the differences between the subject matter sought to be patented and 
the prior art are such that the subject matter as a whole would have been obvious at the time the 
invention was made to a person having ordinary skill in the art to which said subject matter pertains. 
Patentability shall not be negatived by the manner in which the invention was made. 



19. Claims 4-5 are rejected under 35 U.S.C. 103(a) as being unpatentable over 
Bregler (US Patent No. 5880788) in view of Waters et al. (US Patent No. 6256046). 
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20. Claim 4 is dependent on claim 1 , which is anticipated by Bregler et al. for the 

reasons noted above in the 102(b) rejection. 

21 . Regarding claim 4, Bregler further discloses a visual detector for monitoring a 
position of a user (col. 4, In. 1-31 & col. 7, In. 1-11). 

Bregler does not disclose a position detector coupled to the visual detector, the 
position detector comparing the position of the user with a reference position and 
generating a control signal, the control signal being a first value when the position of the 
user is within the reference area and being a. second value when the position of the user 
is not within the reference area; and a label generator coupled to the position detector, 
the label generator displaying a visual indication on a display in response to the control 
signal from the position detector. 

However, Waters et al. teach a position detector coupled to the visual detector, 
the position detector comparing the position of the user with a reference position and 
generating a control signal, the control signal being a first value when the position of the 
user is within the reference area and being a second value when the position of the user 
is not within the reference area (col. 4, In. 20-41)] and a label generator coupled to the 
position detector, the label generator displaying a visual indication on a display in 
response to the control signal from the position detector (col. 5, In. 28-59). The 
advantage of using the teaching of Waters et al. in Bregler is to provide automated 
information to users in public places without human intervention. 
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Since Bregler and Waters et al. are analogous art because they are from the 
same field of endeavors, it would have been obvious to one of ordinary skill in the art at 
the time of invention to modify Bregler by incorporating the teaching of Waters et al. in 
order to detect the presence of users so that the system provides automated 
information to users in public places without human intervention. 

22. Regarding claim 5, Bregler further discloses that the label generator receives 
information from the ASR system, the label generator using the information from the 
ASR system to operatively position the visual indication on the display (coL 10, In. 13- 
30, phonetic information is used to synchronize audio and image data before display to 
the screen figures 6-7). 

23. Claims 7-8 are rejected under 35 U.S.C. 103(a) as being unpatentable over 
Bregler (US Patent No. 5880788) in view of Liou et al. (US Patent No. 6580437). 

24. Claim 7 is dependent on claim 1 , which is anticipated by Bregler et al. for the 
reasons noted above in the 102(b) rejection. 

25. Regarding claim 7, Bregler further discloses a display controller, the display 
controller selectively controlling one or more characteristics of a manner in which the 
image segments are displayed with the corresponding audio played (coL 10, In. 13-30). 
Bregler fails to specifically disclose that the image segments are displayed with 
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corresponding decoded speech text. However, Liou et al. teach that the image 
segments are displayed with corresponding decoded speech text (figure 9). The 
advantage of using the teaching of Liou et al. in Bregler is to provide subtext for the 
hearing impaired individuals. 

Since Bregler and Liou et al. are analogous art because they are from the same 
field of endeavors, it would have been obvious to one of ordinary skill in the art at the 
time of invention to modify Bregler by incorporating the teaching of Liou et al. in order to 
provide subtext for the hearing impaired individuals. 

26. Regarding claim 8, Bregler further discloses that the display controller operatively 
controls the position of an image segment on the display (col. 7, In. 31-40). 

27. Claims 9 and 22 are rejected under 35 U.S.C. 103(a) as being unpatentable over 
Bregler (US Patent No. 5880788) in view of Poggio et al. (US Patent No. 6250928) and 
further in view of Liou et al. (US Patent No. 6580437). 

28. Claims 9 and 22 are dependent on claims 1 and 21 , respectively, which are 
anticipated by Bregler et al. for the reasons noted above in the 102(b) rejection. 

29. Regarding claims 9 and 22, Bregler does not disclose that the image player 
displays each image segment in a separate window on a display in close proximity to 
the decoded speech text corresponding to the image segment. However, Poggio et al. 
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teach that the image player displays each image segment in a separate window (col. 
10, In. 52-67 or figure 8). The advantage of using the teaching of Poggio et al. in 
Bregler is to capture viseme transitions to enable engineer study the mouth shapes 
created by pronoucing each particular phoneme. 

Since Bregler and Poggio et al. are analogous art because they are from the 
same field of endeavors, it would have been obvious to one of ordinary skill in the art at 
the time of invention to modify Bregler by incorporating the teaching of Poggio et al. in 
order to capture viseme transitions to enable engineers study the mouth shapes created 
by pronouncing each particular phoneme. 

The modified Bregler still does not disclose that each image segment is 
displayed in close proximity to the decoded speech text. However, Liou et al. further 
teach that each image segment is displayed in close proximity to the decoded speech 
text (figure 9). The advantage of using the teaching of Liou et al. in Bregler is to provide 
subtext for the hearing impaired individuals. 

Since Bregler and Liou et al. are analogous art because they are from the same 
field of endeavors, it would have been obvious to one of ordinary skill in the art at the 
time of invention to modify Bregler by incorporating the teaching of Liou et al. in order to 
provide subtext for the hearing impaired individuals. 

30. Claims 13-15, 17, 19, and 24 are rejected under 35 U.S.C. 103(a) as being 
unpatentable over Bregler (US Patent No. 5880788) in view of Goldenthal et al. (US 
Patent No. 5884267). 
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31. Regarding claim 13, Bregler discloses a method for presenting images 
representative of one or more words in an utterance with corresponding decoded 
speech, the method comprising the steps of: 

capturing a plurality of images representing body movements corresponding to 
the one or more words in the utterance (col. 4, In. 1-31 & col. 7, In. 1-1 1)\ 

associating each of the captured images with time information relating to an 
occurrence of the image (col. 5, In. 28-48, phonetic information is the time information)] 

receiving, from an automatic speech recognition (ASR) system, phonetic 
information of a word decoded by the ASR system (col. 8, In. 57 to col. 9, In. 15, 
phonetic information is the time information)] 

aligning the plurality of images into one or more image segments according to 
the phonetic information received from the ASR system, wherein each image segment 
corresponds to a decoded word in the utterance (col. 9, In. 1-16); and 

presenting an image segment with a corresponding decoded word (figures 6-7, 
the decoded word can be presented audibly). 

Bregler does not disclose a start time and an end time generated by the ASR 
system. However, Goldenthal et al. teach a start time and an end time generated by the 
ASR system (col. 4, In. 14-22). The advantage of using the teaching of Goldenthal et al. 
in Bregler is to provide a means for synchronizing the audio and image segments. 

Since Bregler and Goldenthal et al. are analogous art because they are from the 
same field of endeavors, it would have been obvious to one of ordinary skill in the art at 
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the time of invention to modify Bregler by incorporating the teaching of Goldenthal et al. 
in order to provide a means for synchronizing the audio and image segments. 

32. Regarding claim 24, Bregler discloses a method for presenting images 
representative of one or more words in an utterance with corresponding decoded 
speech, the method comprising the steps of: 

providing an automatic speech recognition (ASR) engine (figure 2); 

decoding, in the ASR engine, the utterance into one or more words, each of the 
decoded having phonetic information associated therewith (col. 8, In. 57 to col. 9, In. 15, 
phonetic information is the time information); 

decoded words having a start time and a stop time associated therewith (); 

capturing a plurality of images representing body movements corresponding to 
the one or more words in the utterance (col. 4, In. 1-31 & col. 7, In. 1-1 1)\ 

buffering the plurality of images by a predetermined delay (col. 9, In. 48 to col. 
10, In. 57, by utilizing time warping or time-scaled modification techniques to realize 
synchronization is the delay); 

receiving, from the ASR engine, data including the phonetic information of a 
decoded word (col. 8, In. 57 to col. 9, In. 15, phonetic information is the time information 

aligning the plurality of images into one or more image segments according to 
the phonetic information received from the ASR engine, wherein each image segment 
corresponds to a decoded word in the utterance (col. 9, In. 1-16); and 
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presenting an image segment with a corresponding decoded word (figures 6-7, 
the decoded word can be presented audibly). 

Bregler does not disclose a start time and an end time generated by the ASR 
system. However, Goldenthal et al. teach a start time and an end time generated by the 
ASR system {col. 4, In. 14-22). The advantage of using the teaching of Goldenthal et al. 
in Bregler is to provide a means for synchronizing the audio and image segments. 

Since Bregler and Goldenthal et al. are analogous art because they are from the 
same field of endeavors, it would have been obvious to one of ordinary skill in the art at 
the time of invention to modify Bregler by incorporating the teaching of Goldenthal et al. 
in order to provide a means for synchronizing the audio and image segments. 

33. Regarding claims 14-15, Bregler further disclose the step of selectively 
controlling a delay between when an image segment is presented and when a decoded 
word corresponding to the image segment is presented and selectively controlling a 
manner in which an image segment is presented with a corresponding decoded word 
(col. 9, In. 48 to col. 10, In. 57, by utilizing time warping or time-scaled modification 
techniques to realize synchronization). 

34. Regarding claim 1 7, Bregler further discloses that the step of aligning the plurality 
of images comprises: comparing the time information relating to the captured images 
with the start and stop times for a decoded word (col. 8, In. 57 to col. 9, In. 16)] and 
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determining which of the plurality of images occur within a time interval defined by the 
start and stop times of the decoded word (col. 9, In. 7-76). 

35. Regarding claim 1 9, Bregler does not disclose the step of obtaining time ends for 
a decoded word from the ASR system comprises determining a start time and a stop 
time associated with the decoded word. However, Goldenthal et al. teach the step of 
determining a start time and a stop time associated with the decoded word (col. 4, In. 
14-22). The advantage of using the teaching of Goldenthal et al. in Bregler is to provide 
a means to synchronize audio signal to corresponding images. 

Since Bregler and Goldenthal et al. are analogous art because they are from the 
same field of endeavors, it would have been obvious to one of ordinary skill in the art at 
the time of invention to modify Bregler by incorporating the teaching of Goldenthal et al. 
in order to provide a means to synchronize audio signal to corresponding images. 

36. Claim 16 is rejected under 35 U.S.C. 103(a) as being unpatentable over Bregler 
(US Patent No. 5880788) in view of Goldenthal et al. (US Patent No. 5884267) as 
applied to claim 13 above, and further in view of Waters et al. (US Patent No. 6256046). 

37. Regarding claim 16, Bregler further discloses a visual detector for monitoring a 
position of a user (col. 4, In. 1-31 & col. 7, In. 1-11). 

Bregler does not disclose a position detector coupled to the visual detector, the 
position detector comparing the position of the user with a reference position and 
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generating a control signal, the control signal being a first value when the position of the 
user is within the reference area and being a second value when the position of the user 
is not within the reference area; and a label generator coupled to the position detector, 
the label generator displaying a visual indication on a display in response to the control 
signal from the position detector. 

However, Waters et al. teach a position detector coupled to the visual detector, 
the position detector comparing the position of the user with a reference position and 
generating a control signal, the control signal being a first value when the position of the 
user is within the reference area and being a second value when the position of the user 
is not within the reference area (col. 4, In. 20-41)] and a label generator coupled to the 
position detector, the label generator displaying a visual indication on a display in 
response to the control signal from the position detector {col. 5, In. 28-59). The 
advantage of using the teaching of Waters et al. in Bregler is to provide automated 
information to users in public places without human intervention. 

Since Bregler and Waters et al. are analogous art because they are from the 
same field of endeavors, it would have been obvious to one of ordinary skill in the art at 
the time of invention to modify Bregler by incorporating the teaching of Waters et al. in 
order to detect the presence of users so that the system provides automated 
information to users in public places without human intervention. 



Conclusion 
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THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time 
policy as set forth in 37 CFR 1.136(a). 

A shortened statutory period for reply to this final action is set to expire THREE 
MONTHS from the mailing date of this action. In the event a first reply is filed within 
TWO MONTHS of the mailing date of this final action and the advisory action is not 
mailed until after the end of the THREE-MONTH shortened statutory period, then the 
shortened statutory period will expire on the date the advisory action is mailed, and any 
extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of 
the advisory action. In no event, however, will the statutory period for reply expire later 
than SIX MONTHS from the mailing date of this final action. 

The prior art made of record and not relied upon is considered pertinent to 
applicant's disclosure. Braida et al. (US Patent No. 6317716) teach a method for 
automatically cueing speech that is considered pertinent to the claimed invention. 

Any inquiry concerning this communication or earlier communications from the 
examiner should be directed to Huyen Vo whose telephone number is 703-305-8665. 
The examiner can normally be reached on M-F, 9-5:30. 

If attempts to reach the examiner by telephone are unsuccessful, the examiner's 
supervisor, Doris To can be reached on 703-305-4827. The fax phone number for the 
organization where this application or proceeding is assigned is 703-872-9306. 
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Information regarding the status of an application may be obtained from the 
Patent Application Information Retrieval (PAIR) system. Status information for 
published applications may be obtained from either Private PAIR or Public PAIR. 
Status information for unpublished applications is available through Private PAIR only. 
For more information about the PAIR system, see http://pair-direct.uspto.gov. Should 
you have questions on access to the Private PAIR system, contact the Electronic 
Business Center (EBC) at 866-217-9197 (toll-free). 

Huyen X. Vo 
February 15,2005 
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